import warnings
warnings.filterwarnings('ignore')
More on Ratio versus Interval
Data Description:
I found the reference to this data set here. Ames is a small city in the state of Iowa in the United States. It’s home to Iowa State University, which is the largest university in the state.
The Ames housing dataset examines features of houses sold in Ames during the 2006–10 timeframe in and attempt to predict sale price of houses based on their features
Usually used a academic data set to build a regression model to predict the sale prices of the houses
Domain: Real Estate
Context: zillow.com ran a competition in kaggle for a $1.2 Million prize to accurately predict price of real estate properties. In fact Zilow Reseach provides housing data for download for citizen data scientists.
The Ames data is in the same thread and very common in academic circles to explain regression and non-linearity of features.
In this class we shall use the data set to explore data and understand better how predictor variables drive/ impact target/ dependent variable; and how predictor may be related to each other.
Variable Category | Description | Column Name |
---|---|---|
Nominal | Categorical, Mutually exclusive, but not ordered, categories | MS Zoning, Street Type, Land Contour, Lot Config, Neighborhood |
Ordinal | Categorical, order matters but not the difference between values | Utilities, Land Slope, Overall Qual,Overall Cond |
Interval | Difference between two values is meaningful | Garage Yr Blt, Yr Sold |
Ratio | All the properties of an interval variable, but also has a clear definition of 0.0 | Lot Frontage, Lot Area |
# standard libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Plotting pretty figures and avoid blurry images
%config InlineBackend.figure_format = 'retina'
# Larger scale for plots in notebooks
sns.set_context('notebook')
df = pd.read_csv('https://raw.githubusercontent.com/hjhuney/Data/master/AmesHousing/train.csv', index_col=0)
For the sake of the exercise I am just going to make sure every value is populated. We shall talk in details about treatment of missing values in the next class
df.isnull().sum()[df.isnull().sum()>0]
LotFrontage 259 Alley 1369 MasVnrType 8 MasVnrArea 8 BsmtQual 37 BsmtCond 37 BsmtExposure 38 BsmtFinType1 37 BsmtFinType2 38 Electrical 1 FireplaceQu 690 GarageType 81 GarageYrBlt 81 GarageFinish 81 GarageQual 81 GarageCond 81 PoolQC 1453 Fence 1179 MiscFeature 1406 dtype: int64
df.fillna(df.median(),inplace=True)
Data Dictionary is very detailed and can be found here. Please make sure to familiarize
This is a tabular data. We add the column names based on the data definition
df.shape
(1460, 80)
df.columns[df.dtypes=='object']
Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition'], dtype='object')
df.columns[df.dtypes=='int']
Index(['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice'], dtype='object')
df.columns[df.dtypes=='float']
Index(['LotFrontage', 'MasVnrArea', 'GarageYrBlt'], dtype='object')
df.loc[:,df.columns[df.dtypes!='object']].describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
LotFrontage | 1460.0 | 69.863699 | 22.027677 | 21.0 | 60.00 | 69.0 | 79.00 | 313.0 |
LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
MasVnrArea | 1460.0 | 103.117123 | 180.731373 | 0.0 | 0.00 | 0.0 | 164.25 | 1600.0 |
BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
BsmtFinSF2 | 1460.0 | 46.549315 | 161.319273 | 0.0 | 0.00 | 0.0 | 0.00 | 1474.0 |
BsmtUnfSF | 1460.0 | 567.240411 | 441.866955 | 0.0 | 223.00 | 477.5 | 808.00 | 2336.0 |
TotalBsmtSF | 1460.0 | 1057.429452 | 438.705324 | 0.0 | 795.75 | 991.5 | 1298.25 | 6110.0 |
1stFlrSF | 1460.0 | 1162.626712 | 386.587738 | 334.0 | 882.00 | 1087.0 | 1391.25 | 4692.0 |
2ndFlrSF | 1460.0 | 346.992466 | 436.528436 | 0.0 | 0.00 | 0.0 | 728.00 | 2065.0 |
LowQualFinSF | 1460.0 | 5.844521 | 48.623081 | 0.0 | 0.00 | 0.0 | 0.00 | 572.0 |
GrLivArea | 1460.0 | 1515.463699 | 525.480383 | 334.0 | 1129.50 | 1464.0 | 1776.75 | 5642.0 |
BsmtFullBath | 1460.0 | 0.425342 | 0.518911 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
BsmtHalfBath | 1460.0 | 0.057534 | 0.238753 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
FullBath | 1460.0 | 1.565068 | 0.550916 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
HalfBath | 1460.0 | 0.382877 | 0.502885 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
BedroomAbvGr | 1460.0 | 2.866438 | 0.815778 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
KitchenAbvGr | 1460.0 | 1.046575 | 0.220338 | 0.0 | 1.00 | 1.0 | 1.00 | 3.0 |
TotRmsAbvGrd | 1460.0 | 6.517808 | 1.625393 | 2.0 | 5.00 | 6.0 | 7.00 | 14.0 |
Fireplaces | 1460.0 | 0.613014 | 0.644666 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
GarageYrBlt | 1460.0 | 1978.589041 | 23.997022 | 1900.0 | 1962.00 | 1980.0 | 2001.00 | 2010.0 |
GarageCars | 1460.0 | 1.767123 | 0.747315 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
GarageArea | 1460.0 | 472.980137 | 213.804841 | 0.0 | 334.50 | 480.0 | 576.00 | 1418.0 |
WoodDeckSF | 1460.0 | 94.244521 | 125.338794 | 0.0 | 0.00 | 0.0 | 168.00 | 857.0 |
OpenPorchSF | 1460.0 | 46.660274 | 66.256028 | 0.0 | 0.00 | 25.0 | 68.00 | 547.0 |
EnclosedPorch | 1460.0 | 21.954110 | 61.119149 | 0.0 | 0.00 | 0.0 | 0.00 | 552.0 |
3SsnPorch | 1460.0 | 3.409589 | 29.317331 | 0.0 | 0.00 | 0.0 | 0.00 | 508.0 |
ScreenPorch | 1460.0 | 15.060959 | 55.757415 | 0.0 | 0.00 | 0.0 | 0.00 | 480.0 |
PoolArea | 1460.0 | 2.758904 | 40.177307 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
MiscVal | 1460.0 | 43.489041 | 496.123024 | 0.0 | 0.00 | 0.0 | 0.00 | 15500.0 |
MoSold | 1460.0 | 6.321918 | 2.703626 | 1.0 | 5.00 | 6.0 | 8.00 | 12.0 |
YrSold | 1460.0 | 2007.815753 | 1.328095 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
SalePrice | 1460.0 | 180921.195890 | 79442.502883 | 34900.0 | 129975.00 | 163000.0 | 214000.00 | 755000.0 |
df.loc[:,df.columns[df.dtypes=='object']].describe().T
count | unique | top | freq | |
---|---|---|---|---|
MSZoning | 1460 | 5 | RL | 1151 |
Street | 1460 | 2 | Pave | 1454 |
Alley | 91 | 2 | Grvl | 50 |
LotShape | 1460 | 4 | Reg | 925 |
LandContour | 1460 | 4 | Lvl | 1311 |
Utilities | 1460 | 2 | AllPub | 1459 |
LotConfig | 1460 | 5 | Inside | 1052 |
LandSlope | 1460 | 3 | Gtl | 1382 |
Neighborhood | 1460 | 25 | NAmes | 225 |
Condition1 | 1460 | 9 | Norm | 1260 |
Condition2 | 1460 | 8 | Norm | 1445 |
BldgType | 1460 | 5 | 1Fam | 1220 |
HouseStyle | 1460 | 8 | 1Story | 726 |
RoofStyle | 1460 | 6 | Gable | 1141 |
RoofMatl | 1460 | 8 | CompShg | 1434 |
Exterior1st | 1460 | 15 | VinylSd | 515 |
Exterior2nd | 1460 | 16 | VinylSd | 504 |
MasVnrType | 1452 | 4 | None | 864 |
ExterQual | 1460 | 4 | TA | 906 |
ExterCond | 1460 | 5 | TA | 1282 |
Foundation | 1460 | 6 | PConc | 647 |
BsmtQual | 1423 | 4 | TA | 649 |
BsmtCond | 1423 | 4 | TA | 1311 |
BsmtExposure | 1422 | 4 | No | 953 |
BsmtFinType1 | 1423 | 6 | Unf | 430 |
BsmtFinType2 | 1422 | 6 | Unf | 1256 |
Heating | 1460 | 6 | GasA | 1428 |
HeatingQC | 1460 | 5 | Ex | 741 |
CentralAir | 1460 | 2 | Y | 1365 |
Electrical | 1459 | 5 | SBrkr | 1334 |
KitchenQual | 1460 | 4 | TA | 735 |
Functional | 1460 | 7 | Typ | 1360 |
FireplaceQu | 770 | 5 | Gd | 380 |
GarageType | 1379 | 6 | Attchd | 870 |
GarageFinish | 1379 | 3 | Unf | 605 |
GarageQual | 1379 | 5 | TA | 1311 |
GarageCond | 1379 | 5 | TA | 1326 |
PavedDrive | 1460 | 3 | Y | 1340 |
PoolQC | 7 | 3 | Gd | 3 |
Fence | 281 | 4 | MnPrv | 157 |
MiscFeature | 54 | 4 | Shed | 49 |
SaleType | 1460 | 9 | WD | 1267 |
SaleCondition | 1460 | 6 | Normal | 1198 |
plt.figure(figsize=(10,7))
plt.hist(df.SalePrice, color='orange', bins=200,log=False, histtype='bar')
plt.xlabel('Sale Price $')
plt.ylabel('Count')
plt.axvline(df.SalePrice.mean(), color='red', linestyle='solid', linewidth=1.5)
plt.axvline(df.SalePrice.median(), color='brown', linestyle='solid', linewidth=1.5)
plt.axvline(df.SalePrice.describe().loc['25%'], color='brown', linestyle='dashed', linewidth=1)
plt.axvline(df.SalePrice.describe().loc['75%'], color='brown', linestyle='dashed', linewidth=1)
plt.axvline(df.SalePrice.describe().loc['mean']+ 3*df.SalePrice.describe().loc['std'], color='purple', linestyle='dashed', linewidth=1)
# plt.axvline(df.SalePrice.describe().loc['mean']- 3*df.SalePrice.describe().loc['std'], color='purple', linestyle='dashed', linewidth=1)
plt.show()
df.SalePrice.describe().reset_index().T
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|
index | count | mean | std | min | 25% | 50% | 75% | max |
SalePrice | 1460 | 180921 | 79442.5 | 34900 | 129975 | 163000 | 214000 | 755000 |
plt.figure(figsize=(10,7))
# plt.hist(df.SalePrice, color='orange', bins=200,log=False, histtype='bar')
plt.xlabel('Sale Price $')
plt.ylabel('Count')
df.SalePrice[df.SalePrice < (df.SalePrice.describe().loc['mean']+ 3*df.SalePrice.describe().loc['std'])].hist(bins=200)
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e4a10a10>
Let's build a Outlier Flag- Anything larger than mean + 3*std deviation is not a outlier so out_flg == True
df['out_flg']=df.SalePrice > df.SalePrice.describe().loc['mean']+ 3*df.SalePrice.describe().loc['std']
a4_dims = (10, 7)
# df = mylib.load_data()
fig, ax = plt.subplots(figsize=a4_dims)
sns.distplot(df.SalePrice, bins=200, hist=True, kde=True, color= 'magenta', ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e4797850>
from scipy import stats
k2, p = stats.normaltest(df.loc[df.out_flg,'SalePrice'].values)
alpha = .05
print(p)
if p < alpha: # null hypothesis: x comes from a normal distribution
print("The null hypothesis can be rejected")
else:
print("The null hypothesis cannot be rejected")
0.016859801739010225 The null hypothesis can be rejected
sorted_nb = df.groupby(['Neighborhood'])['SalePrice'].median().sort_values(ascending= False)
sorted_nb
#dimension of the plot
a4_dims = (16, 7)
# # df = mylib.load_data()
# assign the height and width
# how to assign the dimension on a plot axis
fig, ax1 = plt.subplots(figsize=a4_dims)
ax1.set_xticklabels(ax.get_xticklabels(),rotation=90)
# , order=list(sorted_nb.index),ax=ax
### Focus on this!!!
### ax=ax is assigning the figure to a axis
sns.boxplot(x=df['Neighborhood'], y=df['SalePrice'], order=list(sorted_nb.index), ax=ax1)
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e4534590>
sns.boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e452edd0>
$100,000
, and in the most expensive neighborhoods houses sell for around $300,000. NridgHt
, however, we see a large box — there is large dispersion in the distribution of prices.Violin plots are similar to box plots, except that they also show the probability density of the data at different values, usually smoothed by a kernel density estimator.
Easy to spot multimodel disributions
sns.catplot(y="MSZoning", x="SalePrice", kind="violin", data=df, height=5, aspect=2, orient='h')
<seaborn.axisgrid.FacetGrid at 0x7f29e44cfa10>
MSZoning
has significant impact on variability of sale pricedf.OverallCond.unique()
array([5, 8, 6, 7, 4, 2, 3, 9, 1])
#util_map= {'AllPub':5,'NoSewr':4,'NoSeWa':3,'ELO':2}
#df['util_ord']= df.Utilities.map(util_map).fillna(1)
df.OverallQual= df.OverallQual.astype(int)
df.OverallCond= df.OverallCond.astype(int)
oqc_order= np.flip(np.sort(df.OverallQual.unique(), axis=0))
np.flip(np.sort(df.OverallQual.unique(), axis=0))
array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
oqc_order
array([10, 9, 8, 7, 6, 5, 4, 3, 2, 1])
sns.catplot(y="OverallQual", x="SalePrice", kind="box", data=df, height=5, aspect=2.5, orient='h',order= oqc_order)
<seaborn.axisgrid.FacetGrid at 0x7f29e4b3ad90>
SalePrice
increases as quality improves; the relationship actually looks nonlinearSalePrice
in each categoryhqC_cat= ['Po', 'Fa', 'TA','Gd', 'Ex']
sns.catplot(x='HeatingQC', y='SalePrice', data=df, kind='bar',order=hqC_cat, height=5, aspect=2)
<seaborn.axisgrid.FacetGrid at 0x7f29e4134a10>
cols= ['1stFlrSF','MasVnrArea','LotArea','YearBuilt']
df.loc[:,cols].hist(bins=25, figsize=(12, 12), layout=(2, 2))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f29e4060410>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f29e3fae590>], [<matplotlib.axes._subplots.AxesSubplot object at 0x7f29e3f61a10>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f29e3f17f90>]], dtype=object)
a4_dims = (10, 7)
fig, ax = plt.subplots(figsize=a4_dims)
sns.scatterplot(x='LotArea', y='SalePrice', data=df,ax=ax )
# sns.scatterplot(x='LotArea', y='SalePrice', data=df.loc[df.LotArea < 50000,:],ax=ax )
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e3d47290>
a4_dims = (10, 7)
fig, ax = plt.subplots(figsize=a4_dims)
# sns.scatterplot(x='LotArea', y='SalePrice', data=df,ax=ax )
sns.scatterplot(x='LotArea', y='SalePrice', data=df.loc[df.LotArea < 25000,:],ax=ax )
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e3cb2890>
a4_dims = (10, 7)
fig, ax = plt.subplots(figsize=a4_dims)
# sns.scatterplot(x='LotArea', y='SalePrice', data=df,ax=ax )
sns.scatterplot(x='LotArea', y='SalePrice', data=df.loc[df.LotArea < 25000,:],ax=ax, hue='BldgType' )
<matplotlib.axes._subplots.AxesSubplot at 0x7f29e3c16a10>
sns.jointplot(x="LotArea", y="SalePrice",kind='hist', data=df.loc[df.LotArea < 25000,:], ax=ax, height=7)
<seaborn.axisgrid.JointGrid at 0x7f29e3ba4dd0>
sns.jointplot(data=df.loc[df.LotArea < 25000,:], x="LotArea", y="SalePrice", hue="BldgType", height=7, kind='scatter')
<seaborn.axisgrid.JointGrid at 0x7f29e39ba890>
.corr()
cluster headmapdf.loc[:, varlst].corr()
SalePrice | LotArea | OverallQual | |
---|---|---|---|
SalePrice | 1.000000 | 0.263843 | 0.790982 |
LotArea | 0.263843 | 1.000000 | 0.105806 |
OverallQual | 0.790982 | 0.105806 | 1.000000 |
sns.clustermap(df.loc[:, varlst].corr())
<seaborn.matrix.ClusterGrid at 0x7f29e3aaf5d0>
varlst= ['SalePrice', 'LotArea', 'BldgType', 'OverallQual', 'HeatingQC', 'MSZoning']
df.loc[3, 'SalePrice']
223500
gf=df.loc[df.LotArea < 25000,varlst]
.pairplot()
cluster heatmapsns.pairplot(gf, corner=True, hue='BldgType', diag_kind='hist', height=4, aspect=1)
<seaborn.axisgrid.PairGrid at 0x7f29e371c8d0>
.subplot()
cluster headmapfig, axes = plt.subplots(3, 3,figsize=(15,15), sharey= False)
sns.barplot(ax=axes[1,2], x=gf.BldgType, y=gf.SalePrice)
axes[1,2].set_title('SalePrice vs. Bldg Type Bar')
sns.kdeplot(ax=axes[1,1], data=gf, x="LotArea", y="SalePrice", hue="MSZoning", fill=False)
axes[1,1].set_title('SalePrice vs. MSZoning Kde')
Text(0.5, 1.0, 'SalePrice vs. MSZoning Kde')
fig, axes = plt.subplots(3, 3,figsize=(15,15), sharey= False)
## 1st row
sns.barplot(ax=axes[0,0], x=gf.BldgType, y=gf.SalePrice)
axes[0,0].set_title('SalePrice vs. Bldg Type Bar')
sns.barplot(ax=axes[0,1], x=gf.HeatingQC, y=gf.SalePrice)
axes[0,1].set_title('SalePrice vs. HeatingQC')
sns.barplot(ax=axes[0,2], x=gf.MSZoning, y=gf.SalePrice)
axes[0,2].set_title('SalePrice vs. MSZoning')
## 2nd row
sns.boxplot(ax=axes[1,0], x=gf.BldgType, y=gf.SalePrice)
axes[1,0].set_title('SalePrice vs. Bldg Type Box')
sns.boxplot(ax=axes[1,1], x=gf.HeatingQC, y=gf.SalePrice)
axes[1,1].set_title('SalePrice vs. HeatingQC Box')
sns.boxplot(ax=axes[1,2], x=gf.MSZoning, y=gf.SalePrice)
axes[1,2].set_title('SalePrice vs. MSZoning Box')
## 3rd row
sns.kdeplot(ax=axes[2,0], data=gf, x="LotArea", y="SalePrice", hue="BldgType", fill=False)
axes[2,0].set_title('SalePrice vs. Bldg Type Kde')
sns.kdeplot(ax=axes[2,1], data=gf, x="LotArea", y="SalePrice", hue="HeatingQC", fill=False)
axes[2,1].set_title('SalePrice vs. HeatingQC Kde')
sns.kdeplot(ax=axes[2,2], data=gf, x="LotArea", y="SalePrice", hue="MSZoning", fill=False)
axes[2,2].set_title('SalePrice vs. MSZoning Kde')
Text(0.5, 1.0, 'SalePrice vs. MSZoning Kde')